Add optional reward scaling #95

maxreciprocate · 2022-11-16T23:59:10Z

In the spirit of #48
reward scaling from https://github.com/DLR-RM/stable-baselines3/blob/d5d1a02c15cdce868c72bbc94913e66fdd2efd3a/stable_baselines3/common/vec_env/vec_normalize.py#L220
minibatch whitening from https://github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/ppo/ppo.py#L80

https://wandb.ai/sorry/public/reports/mean_reward-22-11-17-01-59-30---VmlldzoyOTg0ODc3

LouisCastricato

Looks good. Just needs more comments.

LouisCastricato · 2022-11-17T13:25:06Z

configs/ppo_config.yml

@@ -6,7 +6,7 @@ model:

 train:
  seq_length: 48  # Size of LM context
-  epochs: 1000 # Train for max(epochs, total_steps)


Thank god we finally decreased this haha

LouisCastricato · 2022-11-17T13:25:46Z

trlx/orchestrator/ppo_orchestrator.py

+            scores = torch.as_tensor(self.score(texts), device=samples.device)
+            stats["exp_score_time"] = time() - exp_score_time
+
+            if self.ref_mean is None:


Some comments here about what this does would be helpful :)

LouisCastricato · 2022-11-17T13:26:28Z

trlx/utils/modeling.py

+        delta = xs_mean - self.mean
+        tot_count = self.count + xs_count
+
+        m_a = self.var * self.count


I hate having lots of math and no intuitive explanation of what the math is doing as comments. Please fix.

LouisCastricato · 2022-11-17T13:27:14Z

Can you post the W&B link to a run where we can confirm rescaling works? Thanks.

Dahoas · 2022-11-17T14:20:44Z

Can you post the W&B link to a run where we can confirm rescaling works? Thanks.

Agreed let's make sure to do this on all algo changing PRs.

LouisCastricato · 2022-11-17T14:26:49Z

Run looks good to me. Add comments and we can merge!

Dahoas · 2022-11-17T14:21:35Z

configs/ppo_config.yml

@@ -35,6 +35,8 @@ method:
  cliprange: 0.2  # clip range
  cliprange_value: 0.2  # clip range
  vf_coef: 2.3  # value term weight
+  scale_reward: True
+  clip_reward: 10


Shouldn't reward generally be in the range [-1,1]?

Dahoas · 2022-11-17T14:24:01Z

trlx/orchestrator/ppo_orchestrator.py

+            stats["exp_score_time"] = time() - exp_score_time
+
+            if self.ref_mean is None:
+                self.ref_mean, self.ref_std = scores.mean(), scores.std()


Perhaps naming this ref_mean is a bit misleading? It is not the mean of the reference model but rather the mean of the training model.

Dahoas · 2022-11-17T14:30:40Z

trlx/utils/modeling.py

+
+
+def whiten(xs: torch.Tensor, shift_mean=True, distributed=True) -> torch.Tensor:
+    """Whitens values"""


Out of curiosity do we have a reference for whitening? (Some blog post, arxiv paper)

Dahoas · 2022-11-17T14:30:53Z

trlx/utils/modeling.py

+    )
+
+
+class RunningMoments:


Dahoas · 2022-11-17T14:32:32Z

trlx/utils/modeling.py

+
+
+class RunningMoments:
+    def __init__(self):


If we precompute the mean and var of our initial reward distribution ahead of time do we have a way of incorporating that?

Dahoas · 2022-11-17T14:33:21Z

Approved though I'd like to know if pre-computed mean, var of baseline rewards can be used as well

fix(ppo): optional reward scaling and minibatch advantage whitening

dcabc51

maxreciprocate marked this pull request as ready for review November 17, 2022 12:40

feat(ppo): add optional reward clipping

267c9d4

LouisCastricato self-requested a review November 17, 2022 13:24

LouisCastricato reviewed Nov 17, 2022

View reviewed changes

Dahoas approved these changes Nov 17, 2022

View reviewed changes

Dahoas merged commit 12598e9 into main Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional reward scaling #95

Add optional reward scaling #95

maxreciprocate commented Nov 16, 2022 •

edited

Loading

LouisCastricato left a comment

LouisCastricato Nov 17, 2022

LouisCastricato Nov 17, 2022

LouisCastricato Nov 17, 2022

LouisCastricato commented Nov 17, 2022

Dahoas commented Nov 17, 2022

LouisCastricato commented Nov 17, 2022

Dahoas Nov 17, 2022

Dahoas Nov 17, 2022

Dahoas Nov 17, 2022

Dahoas Nov 17, 2022

Dahoas Nov 17, 2022

Dahoas commented Nov 17, 2022



		def whiten(xs: torch.Tensor, shift_mean=True, distributed=True) -> torch.Tensor:
		"""Whitens values"""

		)


		class RunningMoments:

Add optional reward scaling #95

Add optional reward scaling #95

Conversation

maxreciprocate commented Nov 16, 2022 • edited Loading

LouisCastricato left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LouisCastricato commented Nov 17, 2022

Dahoas commented Nov 17, 2022

LouisCastricato commented Nov 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dahoas commented Nov 17, 2022

maxreciprocate commented Nov 16, 2022 •

edited

Loading